home *** CD-ROM | disk | FTP | other *** search
- Path: solon.com!not-for-mail
- From: Chris Torek <torek@elf.bsdi.com>
- Newsgroups: comp.lang.c,comp.lang.c.moderated
- Subject: Re: C coding problem
- Date: 8 Apr 1996 22:46:58 -0500
- Organization: Berkeley Software Design, Inc.
- Sender: clc@solutions.solon.com
- Approved: clc@solutions.solon.com
- Message-ID: <4kcmji$p13@solutions.solon.com>
- References: <4j06na$808@solutions.solon.com> <4k1qh3$5hn@solutions.solon.com> <4k5vrk$a2d@solutions.solon.com> <4kbdq6$ec2@solutions.solon.com>
- Reply-To: torek@bsdi.com
- NNTP-Posting-Host: solutions.solon.com
-
- (I have removed a bunch of HP-specific groups.)
-
- Someone wrote:
- >>>In such and environment,
- >>> *q++ = *p++;
- >>>[can be slower than]
- >>> q[i] = p[i], ++i;
-
- In article <4k5vrk$a2d@solutions.solon.com>
- schwarz@mips.complang.tuwien.ac.at (Konrad Schwarz) wrote:
- >>Note that indexed addressing often works only for small powers of
- >>two, e.g., x86 family.
-
- In article <4kbdq6$ec2@solutions.solon.com> Verne Arase
- <VArase@varase.it.luc.edu> wrote:
- >Why only small powers of two? You mean if the size of each element is a
- >small power of two (so you can use shift rather than multiply)?
-
- The reason that it generally works `only for small powers of two'
- for such systems is that the hardware indexed-addressing modes only
- support the hardware's datatypes, which typically use 1, 2, 4, and
- 8 8-bit bytes. This is called `scaled indexed addressing': if p
- is a `long *', and long is 8 bytes, p[i] might use the `8 byte
- scaled index' mode to multiply i by 8 before adding it to p, all in
- one instruction.
-
- (Incidentally, on machines with rapid scaled index addressing and
- `load effective address' instructions, these can often be used to
- synthesize multiplication by values such as 3, 5, and 9 in a single
- instruction. GCC will do this for you when possible. This turns
- out to be useful more often you might expect: calculations of the
- form `x * 10' are common, and 10 is 2*5, and x*5 is `lea x,x[x,4]'.)
-
- >AFAIK, indexed addressing works with just about all CISC computers, from
- >the x86 and 68K line through S/360/370/390.
-
- There are CISCs that lack indexed addressing, and RISCs that have
- it, and on some implementations of some CISC architectures, the
- scaled indexing addressing modes are slower than some alternatives.
- It is not generally possible to determine the fastest method of
- copying memory without a lot more information about the hardware.
- (For instance, the SPARC has indexed addressing, but not scaled
- indexed addressing, so that the index must be a byte offset.)
-
- In article <4k60p9$ahr@solutions.solon.com> falstaff@xs4all.nl disparaged
- code of the form:
- >void loserstrcpy(char p[],char q[])
- >{ int i=0;
- >
- > while((q[i]=p[i])!=0)
- > i++;
- >}
-
- As it happens, however, a literal translation of this code to SPARC
- assembly performs better than a literal translation of the pointer
- version.
-
- Finally, I present here some code I wrote and tested specifically
- for a `kernel bcopy' implementation (analagous to the Standard's
- memcpy---i.e., does not handle overlap---but lacks a return value).
- This is the fastest method I have found yet for the SPARC. Note
- that it uses both indexing and pointer increments, all carefully
- chosen based on the target instruction set. The code is byte-order
- dependent and `tuned' for gcc (and yet still requires some minor
- hand-tuning afterward, as gcc creates some unnecessary pointers
- within the unrolled inner loops). The fancy alignment network at
- the end, while complex, is faster than byte-at-a-time copies, on
- the machines I tested.
-
- (Note that the alignment network can be recoded for little-endian
- machines, even those that do not impose alignment constraints in
- hardware. On the other hand, many of those have specialized
- memory-copy instructions (e.g., VAX, x86), are register-poor (x86),
- have different relative CPU-to-memory bandwidth, or have some or
- all of these and other factors that make it unprofitable. The MIPS
- Rx000, which is either-endian and has hardware alignment constriaints,
- has special LWL and LWR instructions for building the left and right
- parts, so that this should not be attempted directly in C.)
-
- /* prototype bcopy, for turning into assembly... */
-
- void
- bcopy(void *src0, void *dst0, unsigned int len) {
- char *src = src0, *dst = dst0;
- unsigned int n, i;
- unsigned int lp, rp; /* left and right partial words */
-
- i = 0;
- n = (int)src | (int)dst;
-
- /*
- * Handle small copies separately. We *must* handle 6 or fewer
- * bytes here as the fancier code may copy 7 without looking at
- * the length, but we might as well take care of anything under 8.
- */
- if (len < 8) {
- /*
- * Too small to fuss much, but handle 4-aligned-bytes
- * quickly.
- */
- if (len >= 4 && (n & 3) == 0) {
- *(int *)dst = *(int *)src;
- i = 4;
- }
- while (i < len)
- dst[i] = src[i], i++;
- return;
- }
-
- /*
- * We have at least 8 bytes to copy. Handle easy doubleword
- * copies (everything perfectly aligned) first.
- */
- if ((n & 7) == 0)
- goto copy_doublewords;
-
- /*
- * No matter which algorithm we choose, the destination must be
- * int-aligned. If it is not even short-aligned, fix that with
- * a one-byte copy now, then choose an alignment algorithm.
- */
- n = (int)src ^ (int)dst;
- if ((int)dst & 1)
- *dst++ = *src++, len--;
- if ((n & 7) == 0) {
- /*
- * Align to doubleword, then copy doublewords. Move a
- * short and/or an int first, if necessary. This plus the
- * byte we might have copied above means that len had
- * better be at least 7.
- */
- if ((int)dst & 2)
- *(short *)dst = *(short *)src,
- dst += 2, src += 2, len -= 2;
- if ((int)dst & 4)
- *(int *)dst = *(int *)src,
- dst += 4, src += 4, len -= 4;
- copy_doublewords:
- /* Copy via ldd/std, with an unrolled inner loop. */
- n = len / 8 / 8;
- if (n) {
- len -= n * 8 * 8;
- #define CP(k) ((long long *)dst)[k] = ((long long *)src)[k]
- do {
- CP(0); CP(1); CP(2); CP(3);
- CP(4); CP(5); CP(6); CP(7);
- dst += 8 * 8, src += 8 * 8;
- } while (--n > 0);
- #undef CP
- }
-
- /* Handle any trailing `long long's. */
- n = len & ~7;
- while (i < n)
- *(long long *)&dst[i] = *(long long *)&src[i], i += 8;
-
- /* Mop up trailing int, short, and/or byte. */
- if ((n = len - i) == 0)
- return;
- if (n >= 4) {
- *(int *)&dst[i] = *(int *)&src[i], i += 4;
- goto trailing_check;
- }
- goto trailing_123;
- }
-
- if ((n & 3) == 0) {
- /*
- * Align to int, then copy ints. This is just like the
- * doubleword loop, but moves only four bytes at a crack.
- */
- if ((int)dst & 2)
- *(short *)dst = *(short *)src,
- dst += 2, src += 2, len -= 2;
-
- n = len / 4 / 8;
- if (n) {
- len -= n * 4 * 8;
- #define CP(k) ((int *)dst)[k] = ((int *)src)[k]
- do {
- CP(0); CP(1); CP(2); CP(3);
- CP(4); CP(5); CP(6); CP(7);
- dst += 4 * 8, src += 4 * 8;
- } while (--n > 0);
- #undef CP
- }
- n = len & ~3;
- while (i < n)
- *(int *)&dst[i] = *(int *)&src[i], i += 4;
- trailing_check:
- if ((n = len - i) == 0)
- return;
- trailing_123:
- if (n >= 2) {
- /* 2 or 3 bytes left */
- *(short *)&dst[i] = *(short *)&src[i];
- if (n == 2)
- return;
- i += 2;
- }
- dst[i] = src[i];
- return;
- }
-
- /*
- * Cannot even int-align, but still have at least 7 bytes to copy,
- * and dst is at least short-aligned. Copy up to two more bytes
- * to make it int-aligned, leaving at least 5 bytes to go. This
- * will leave src unaligned with a 1, 2, or 3-byte offset.
- *
- * This code is rather complex; see the pictures below for operation.
- */
- if ((int)dst & 3) {
- *dst++ = *src++;
- len--;
- if ((int)dst & 3) {
- *dst++ = *src++;
- len--;
- }
- }
- n = len / 4; /* n >= 1 */
- if (((int)src & 3) == 1) {
- /*
- * src dst
- * +---------------+ +---------------+
- * | | 0 | 1 | 2 | | 0 | 1 | 2 | 3 |
- * +---------------+ +---------------+
- * | 3 | . | . | . | | . | . | . | . |
- * +---------------+ +---------------+
- * | . | w | x | y | | w | x | y | z |
- * +---------------+ +---------------+
- * | z | ? | ? | ? | | ? | ? | ? | |
- * +---------------+ +---------------+
- *
- * (where `.'s repeat as needed and `z' is the last of
- * the fullword-at-dst bytes). The `empty' slots always
- * exist at src, but need not exist at dst, and we may
- * may need to write none, one, two, or all three of the
- * bytes marked `?' (depending on the leftover length).
- */
- /* Pick up the first three bytes, discarding -1st byte. */
- lp = *(int *)&src[-1] << 8, src += 3;
- do {
- /* lp contains 3 `left hand' bytes */
- rp = *(int *)&src[i];
- /* now rp has `right hand' byte in top nibble */
- *(int *)&dst[i] = lp | (rp >> 24);
- lp = rp << 8;
- i += 4;
- } while (--n > 0);
- /* Up to 3 bytes remain; they are available in lp. */
- if ((len -= i) != 0) {
- if (len == 1)
- dst[i] = lp >> 24;
- else if (len == 2)
- *(short *)&dst[i] = lp >> 16;
- else {
- *(short *)&dst[i] = lp >> 16;
- dst[i + 2] = lp >> 8;
- }
- }
- } else if (((int)src & 3) == 2) {
- /*
- * src dst
- * +---------------+ +---------------+
- * | | | 0 | 1 | | 0 | 1 | 2 | 3 |
- * +---------------+ +---------------+
- * | 2 | 3 | . | . | | . | . | . | . |
- * +---------------+ +---------------+
- * | . | . | w | x | | w | x | y | z |
- * +---------------+ +---------------+
- * | y | z | ? | ? | | ? | ? | ? | +
- * +---------------+ +---------------+
- * | ? |
- * +---+
- *
- * Note that this time we may have to fetch one last
- * byte (the third `?') from src when the full-word
- * loop is done.
- */
- /* Pick up the first two bytes. */
- lp = *(unsigned short *)&src[0] << 16, src += 2;
- do {
- /* lp contains 2 `left hand' bytes */
- rp = *(int *)&src[i];
- /* now rp has two `right hand' bytes */
- *(int *)&dst[i] = lp | (rp >> 16);
- lp = rp << 16;
- i += 4;
- } while (--n > 0);
- /* Up to 3 bytes remain, but only 2 are in lp. */
- if ((len -= i) != 0) {
- if (len == 1)
- dst[i] = lp >> 24;
- else if (len == 2)
- *(short *)&dst[i] = lp >> 16;
- else {
- *(short *)&dst[i] = lp >> 16;
- rp = src[i];
- i += 2;
- dst[i] = rp;
- }
- }
- } else /* (((int)src & 3) == 3) */ {
- /*
- * src dst
- * +---------------+ +---------------+
- * | | | | 0 | | 0 | 1 | 2 | 3 |
- * +---------------+ +---------------+
- * | 1 | 2 | 3 | . | | . | . | . | . |
- * +---------------+ +---------------+
- * | . | . | . | w | | w | x | y | z |
- * +---------------+ +---------------+
- * | x | y | z | ? | | ? | ? | ? | |
- * +---------------+ +---------------+
- * | ? | ? |
- * +---+---+
- *
- * Now we may have to read one or two last bytes from src
- * when the full-word loop is done.
- */
- lp = *src++ << 24;
- do {
- rp = *(int *)&src[i];
- /* now rp has three `right hand' bytes */
- *(int *)&dst[i] = lp | (rp >> 8);
- lp = rp << 24;
- i += 4;
- } while (--n > 0);
- /* Up to 3 bytes remain, but only one is in lp. */
- if ((len -= i) != 0) {
- if (len == 1)
- dst[i] = lp >> 24;
- else {
- rp = *(unsigned short *)&src[i];
- lp |= rp << 8;
- *(short *)&dst[i] = lp >> 16;
- if (len == 3) {
- i += 2;
- dst[i] = rp /* lp >> 8 */;
- }
- }
- }
- }
- }
- --
- In-Real-Life: Chris Torek, Berkeley Software Design Inc
- El Cerrito, CA Domain: torek@bsdi.com +1 510 234 3167
-